[nexus] FM alert requests and rendezvous task by hawkw · Pull Request #9552 · oxidecomputer/omicron

hawkw · 2025-12-19T21:57:23Z

This branch builds on #9492 by adding alert requests to fault management cases. This is a mechanism to allow a sitrep to specify that a set of alerts should exist. UUIDs and payloads for these alerts are specified in the sitrep. We don't want entries in the alert table to be created immediately when a sitrep is inserted, as that sitrep may not be made current. If the alert dispatcher operates on alerts created in sitreps that were not made current, it could dispatch duplicate or spurious alerts. Instead, we indirect the creation of alert records by having the sitrep insertion create alert requests, and if that sitrep becomes current, a background fm_rendezvous task reconciles the requested alerts in the sitrep with the actual alert table. Eventually, this task will be responsible for updating other rendezvous tables based on the current sitrep.

I also did a bit of refactoring of the alert class types so that the structured enum of alert classes could be used by the sitrep.

This change was originally factored out from #9346, but I ultimately ended up rewriting a lot of it.

smklein

Overall structure looks good, but I've got some questions about lifetimes of things

schema/crdb/dbinit.sql

nexus/types/src/alert.rs

nexus/types/src/internal_api/background.rs

nexus/src/app/background/tasks/fm_rendezvous.rs

smklein · 2026-01-05T20:33:59Z

nexus/src/app/background/tasks/fm_rendezvous.rs

+            let class = class.into();
+            match self
+                .datastore
+                .alert_create(&opctx, id, class, payload.clone())


I don't think we're deleting alert records yet - AFAICT, we're marking them dispatched, but leaving rows in CRDB for them - but when we do, this will be something we need to consider.

Suppose we want to delete an alert record from cockroachdb

Suppose there is a really laggy Nexus somewhere, running this rendezvous task. It's stuck doing rendezvous for a very old sitrep.

If we do "actual SQL DELETE" of the alert, this background task could theoretically bring it back to life (which would be a bug)

I don't think this problem has been totally solved for blueprints either - I'm not seeing such guards in reconcile_blueprint_rendezvous_tables either - but from a discussion with @jgallagher , the priority there was lower, because the rendezvous tables for blueprints are much lower-churn than they presumably will be for alerts.

I wrote up an issue for this on the blueprint side with #9592 , but I think it'll be relevant here much sooner, especially as each alert is injecting an arbitrary JSON payload, which means the table is going to grow in size more quickly.

You're correct that the alert records are currently never deleted. I think we do need to figure out a strategy for that (see #8076). For the near term, I think we ought to add a nullable case_id column to the alert record, so that alerts created by FM alert requests can point back to the FM case that owns them --- we might want to use that to determine if FM alerts are safely deleteable by checking if that case is still in the current sitrep, or something. WDYT?

Makes sense to me - do we have something similar for webhook_delivery_attempts? I'm noticing that schema references the alert record, and presumably would want to stop if the case was closed?

I don't think we would want to stop attempting delivery if a case is closed, necessarily. For instance, we might be trying to deliver an alert for the resolution of an Active Problem, which closes the case.

I was mostly just thinking that we would avoid deleting alerts which were requested by a case that still exists in the current sitrep as a way to avoid accidentally re-creating those alerts in reconciliation.

hrmmm so in #9592 I mentioned a couple possible solutions to avoid re-creating alerts. I figured we were leaning towards option (1) - "guard the INSERT" operation - which would attempt to prevent re-creating alerts if our state is out-of-date. How would that work with a new case_id column on the alert table?

When we're doing reconciliation for a sitrep, we load cases, and generate alerts based on those cases, right? So from the point-of-view of a really slow Nexus acting on old data, it thinks there is a valid case associated with the alert, no? Or are you suggesting "only insert the alert if there exists a corresponding fm_case in the database which appears open"?

Oh, I was imagining that the case ID column would be used to guard against deleting an alert if it was requested by a case that is still open. Sorry if that was unclear!

Ah, so we'd still need a guard for the INSERT in that case, right? The code as-is operates on a sitrep that has been loaded into a particular Nexus's memory -- so it might actually be creating an alert for a fully closed case, unless we have some way to stop it in create_requested_alerts

Yeah, that's correct. I still need to work out how I want to do that.

nexus/db-queries/src/db/datastore/fm.rs

Co-authored-by: Sean Klein <sean@oxide.computer>

smklein

LGTM modulo the question about alerts - I want to be cautious merging and adding alerts, without having a known back-stop for them. If we don't think we're going to be generating alerts immediately after this is merged, we can be more lenient - I just want to make sure we either fix it or keep our eye on it.

nexus/src/app/background/tasks/fm_rendezvous.rs

smklein · 2026-01-27T23:18:57Z

nexus/src/app/background/tasks/fm_rendezvous.rs

+            let class = class.into();
+            match self
+                .datastore
+                .alert_create(&opctx, id, class, payload.clone())


Makes sense to me - do we have something similar for webhook_delivery_attempts? I'm noticing that schema references the alert record, and presumably would want to stop if the case was closed?

hawkw added 10 commits December 17, 2025 17:04

most of alert requests

e48ad41

reticulating

f6f5adf

draw more of the owl

69dcfbd

reticulating docs

5f0ae87

tests

c9da743

migrations

742de01

add config files

9e59159

actually delete cases

4197cdf

fixup stuff

98d9dd5

omdb

68a78cd

hawkw requested review from davepacheco and smklein December 19, 2025 21:57

hawkw self-assigned this Dec 19, 2025

hawkw added nexus Related to nexus fault-management Everything related to the fault-management initiative (RFD480 and others) labels Dec 19, 2025

Merge branch 'main' into eliza/fm-alerts

6d2ca31

hawkw changed the title ~~[nexus] fm alert requests and rendezvous task~~ [nexus] FM alert requests and rendezvous task Dec 19, 2025

hawkw mentioned this pull request Dec 23, 2025

[fm] add a SitrepBuilder, to help with building sitreps #9566

Draft

smklein assigned smklein and unassigned hawkw Jan 5, 2026

smklein mentioned this pull request Jan 5, 2026

Blueprint Rendezvous Table Garbage Collection #9592

Open

smklein reviewed Jan 5, 2026

View reviewed changes

smklein assigned hawkw and unassigned smklein Jan 5, 2026

Update nexus/types/src/internal_api/background.rs

20b53eb

Co-authored-by: Sean Klein <sean@oxide.computer>

smklein mentioned this pull request Jan 5, 2026

It's possible to load a torn blueprint / sitrep / inventory #9594

Closed

improve commentary

1a87101

smklein approved these changes Jan 27, 2026

View reviewed changes

hawkw added 2 commits January 28, 2026 15:06

wip start adding case id to alerts

46ca3bb

Merge branch 'main' into eliza/fm-alerts

ee90071

hawkw added 5 commits February 4, 2026 10:54

post-merge fixup

17fc74a

more post-merge fixup

3886fb6

update omdb db alert to know about fm cases

3499d33

omdb: also include cases in alert list (and filters)

abf84bf

update assert_sitreps_eq to be more sensical

6e3edde

hawkw force-pushed the eliza/fm-alerts branch from 42e1650 to 6e3edde Compare February 4, 2026 19:25

hawkw added 2 commits February 4, 2026 12:41

i didn't love Claude's API design for the case ID stuff

924a5da

update expectorate

2553aba

Comments

Conversation

hawkw commented Dec 19, 2025

Uh oh!

smklein left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

smklein left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants